feat(ingestion/hex): Allow category filters#16752
feat(ingestion/hex): Allow category filters#16752alisa-aylward-toast wants to merge 12 commits intodatahub-project:masterfrom
Conversation
Bundle ReportChanges will increase total bundle size by 13.65kB (0.06%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Linear: ING-2067 Thanks for your contribution! We have created an internal ticket to track this PR. A member of the core DataHub team will be assigned to review it within the next few business days - you will get a follow-up comment once a reviewer is assigned. |
|
Your PR has been assigned to @dineshrathi-dh (dinesh.rathi) for review (ING-2067). |
| if not self.source_config.category_pattern.allowed( | ||
| category.name | ||
| ): | ||
| skip_item = True |
There was a problem hiding this comment.
from the code it seems the requirement is to skip the project_or_component entirely if any of the categories does not match the allowed pattern. Please add more info in the description explaining the expected behavior. Even considering the expected behavior based on code logic, the code looks more complex than needed. this can be simplified as
if project_or_component.categories and any(
not self.source_config.category_pattern.allowed(c.name)
for c in project_or_component.categories
):
continue
There was a problem hiding this comment.
question on how patterns work -- if the no pattern is added for this in the ingestion, will self.source_config.category_pattern.allowed always be true?
There was a problem hiding this comment.
yes. the default is set to allow_all(). .allowed always returns True unless an allow list is specified.
There was a problem hiding this comment.
Also, i would suggest moving the filtering logic to a separate method will be cleaner.
| "categories": [{ | ||
| "name": "Keep_Scratchpad", | ||
| "description": "Intended for broad consumption" | ||
| }], |
There was a problem hiding this comment.
Please add more test cases with mix of allowed and denied categories for more rigorous testing.
| def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]: | ||
| with self.report.new_stage("Fetch Hex assets from Hex API"): | ||
| for project_or_component in self.hex_api.fetch_projects(): | ||
| if project_or_component.categories and any( |
There was a problem hiding this comment.
@alokr-dhub where would you suggest putting this logic in terms of a new function? I ask because having it here pattern matches lines 270 and 278, so I didn't find a natural place to put the function.
i could add it to the HexAPI class if you think that's a good spot for it
There was a problem hiding this comment.
ok. we can keep it here now. ideally all the checks should be in a separate internal method in the hex source class
|
@alisa-aylward-toast LGTM. Please add the test cases. |
|
@alokr-dhub since your first check I did add two more test cases. I added one with two categories -- one that is excluded -- https://github.com/datahub-project/datahub/pull/16752/changes#diff-5c43981e8e278bbb4aec0157f6dc7bce9da1425d7d6e7485ca807a223a499035R101 |
Allowing for filtering on categories in Hex. In the ingestion, users can decide if they want to include/exclude certain Hex categories.